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t^ Abstract 

^> Common methods of causal inference generate directed acyclic graphs (DAGs) that 

^^ formalize causal relations between n variables. Given the joint distribution of all these 

^1 variables, the DAG contains all information about how intervening on one variable would 

change the distribution of the other n — 1 variables. It remains, however, a non-trivial 

question how to quantify the causal influence of one variable on another one. 

Here we propose a measure for causal strength that refers to direct effects and measure 

the "strength of an arrow" or a set of arrows. It is based on a hypothetical intervention 

^S^ that modifies the joint distribution by cutting the corresponding edge. The causal strength 

jrt is then the relative entropy distance between the old and the new distribution. 

Cj We discuss other measures of causal strength like the average causal effect, transfer 

I I entropy and information flow and describe their limitations. We argue that our measure 

is also more appropriate for time series than the known ones. 
" ' Finally, we discuss conceptual problems in defining the strength of indirect effects. 

^ 1 Introduction 

Inferring causal relations is among the most important scientific goals since causality, as 
(^— ^ opposed to mere statistical dependences, provide the basis for reasonable human decisions. 

pg During the past decade, it has become popular to phrase causal relations in directed acyclic 

T— I graphs (DAGs) [T with random variables (formalizing statistical quantities after repeated 

L| observations) as nodes and causal influences as arrows. 

• ^H We briefly explain this formal setting. Any system in which there are no two-way 

/\ interactions (neither direct nor indirect) can be formalized as a DAG. Let G be such a 

3 causal DAG with nodes Xi, . . . , A„. To simplify notation, we will mostly assume the Xj 

to be discrete. P{xi, . . . ,Xn) denotes the probability mass function of the joint distri- 
bution P(Xi, . . . ,Xn). By replacing sums with integrals, one can get a straightforward 
generalization to continuous variables. If PAj denotes the set of parent variables of Xj 
(i.e., its direct causes) in G, the joint probability factorizes into 

n 
PG{xi,...,Xn) = '[[P{Xj\paj), (1) 

where paj denotes the values of PAj. This factorization is implied [5] by the Markov 
condition stating that every node Xj is conditionally independent of its non-descendants, 
given its parents. According to the Causal Markov Condition, which we take for granted 
in this paper, DAGs are only considered as possible causal DAGs if they render the joint 
distribution Markovian [31 [T]. Here and throughout the paper, we have implicitly assumed 
causal sufficiency, i.e., there are no hidden variables that influence more than one of the n 
observed variables. Moreover, by slightly abusing the notion of conditional probabilities. 



we assume that P{Xj\paj) is also defined for those paj with P[paj) = 0. This means that 
we also know how the causal mechanisms act on potential combinations of values of the 
parents that never occur. 

Given this formalism, one may wonder about the motivation for defining causal strength. 
After all, the DAG together with the joint distribution contains the complete causal in- 
formation: one can easily compute how the joint distribution changes when an external 
intervention sets some of the variables to specific values [1] . However, describing causal 
relations in nature by a DAG always requires to decide how detailed the description should 
be. Depending on the desired precision, one may want to account for some weak causal 
links or not. This motivates the need for an objective criterion about which arrows are 
considered weak. 

We first discuss some definitions of causal strength that are either known or just come 
up as straightforward ideas. 

Average causal effect: Following [T, P{Y\dox) denotes the distribution of Y when 
X is set to the value x (it will be introduced more formally in eq. (Is])). Note that it 
only coincides with the usual conditional distribution P{Y\ x) if the statistical dependence 
between X and Y is due to a direct infiuence of X on y, with no confounding contribution 
of some latent common cause. If all Xi are binary variables, causal strength can then be 
quantified by the Average Causal Effect [H [T] 

ACE{Xi -^ Xj) := P{Xj = l\doX^ = 1) - P{Xj = l\doX, = 0) . 

For real- valued variables Xi that are affected by a binary variable Xj , the shift of the 
mean of Xj that is caused by switching Xi from to 1. Formally, one considers the 
difference [5] 

¥.{Xj\doXi = 1) - V.{X.j\doXi = 0) . 

This measure only accounts for the linear aspect of an interaction. 

Analysis of Variance (ANOVA): Let Xi be caused by Xi, . . . ,Xi-i. Without any 
assumptions, the variance of Xi can formally be split into the average of the variances of 
Xi, given Xk and the variance of the expectations of Xi, given Xk'. 

Var(X,) = Var(X,|Xfe) + Var(E(A,|Xfc)) . 

Within the common scenario of drug testing experiments, for instance, the first term 
describes the variability of Xi within a group of equal treatments (i.e. fixed Xk), while 
the second one describes how much the means of Xi vary between different treatments. 
It is tempting to say that the latter describes the part of the total variation of Xi that 
is caused by the variation of Xk, but this is conceptually wrong for non- linear influences 
and if there are statistical dependences between Xk and the other parents of Xi [Gl '5] . 
For linear structure equations, 

Xi = y^ (^ij^j + ^i with Nj being jointly independent, 

j<i 

with additionally assuming Xk to be independent of the other parents of Xi , the second 
term is given by Yar^aikXk), which indeed describes the amount by which the variance 
of Xi decreases when Xk is set to a fixed value by intervention. In this sense, 

_ Ya.i{aikXk) 
'"''= ■" Var(X,) 

is indeed the fraction of the variance of Xi that is caused by Xk ■ By rescaling all Xj such 
that Va.r{Xj) = 1, we have Vik = afk- Then, the square of the structure coefficients itself 
can be seen as a simple measure for causal strength. 



(Conditional) Mutual information: the information of X on Y or vice versa is given 

The information of X on y or vice versa if Z is given is defined by [7] 

There are situations where these expressions (withs Z describing some background con- 
dition) can indeed be interpreted as measuring the strength of the arrow X ^ Y . An 
essential part of this paper will be devoted to describing the conditions under which this 
makes sense and how to replace the expressions with other information-theoretic ones for 
the other cases. 

Granger causality: Quantifying causal influence between time series (for instance be- 
tween {Xt)t^x and {Yt)t^x) is special because one is interested in quantifying the effect 
of all (Xt) on all (It+s). If we represent the causal relations by a DAG where every time 
instant defines a separate pair of variables, then we ask for the strength of a set of arrows. 
If the time instant t just describes instances of the same variables X,Y, we leave the 
regime of i.i.d. sampling. 

Reduction of uncertainty of one variable after knowing the other is also the key idea of 
several related methods for quantifying causal strength in time series. Granger causality in 
its original formulation uses reduction of variance 8 . A non-linear information-theoretic 
extension that is in the same spirit is transfer entropy 9;. The latter is basically a 
conditional mutual information where each variable X, Y, Z in (pi) is replaced with an 
appropriate set of variables. 

We will discuss conceptual problems with the measures known to us and argue why 
our measure seems to be a better formalization of causal strength. 

2 Postulates for causal strength 

Let us first discuss the properties we would like a measure of causal strength to have. We 
present four properties that we consider reasonable. 

PO. Causal Markov condition and arrows with zero strength: The joint distribu- 
tions satisfies the Markov condition also after removing all arrows of zero strength. 
Formally, this is equivalent to 

€x^Y^O =^ I{X;Y\PA^)^0, 

where PAy denotes the causes of Y other than X. 
PI. Mutual information: If the true causal DAG is given by X — > F, then 

€x^Y = I{X;Y). 

P2. Locality: The strength oi X —> Y only depends on (1) how Y depends on X and 
its other parents, and (2) the joint distribution of all parents of Y. Formally, know- 
ing P{Y\PAy) and P{PAy) is sufficient to compute €x^y- For strictly positive 
densities P{y,paY), this is equivalent to knowing P{Y,PAy)- 

P3. Quantitative causal Markov condition: If there is an arrow from X to Y then 
the causal influence of X on Y is greater or equal than the conditional mutual 
information between Y and A", given all the other parents of Y, formally 

^x^Y>I{X;Y\PA§). 



Note that HOl follows from FJS^ because Y JL X iPAy implies that we can drop the 
link X ^ Y. We have started with F|0]for didactic reasons. We do not claim that every 
reasonable measure of causal strength should satisfy these postulates, but we now explain 
why we consider them as natural. To this end, we also show that the implications for 
simple DAGs make sense. 

Flo] If the purpose of our measure of causal strength is to quantify relevance of an arrow 
then arrows of zero strength must be irrelevant. In particular, removing such an arrow 
X — > F does not yield a DAG that is ruled out by the causal Markov condition. Since 
PAy is the set of parents in the simplified DAG G", we obtain 

Y ± NDy \PA§ , (3) 

where NDy denotes the non-descendants of Y in G' . Note that X cannot be a descendant 
of Y in G" because it has been a parent in the original DAG G. Thus, pi implies 

Y ±X \PA^ . 

Hence, we conclude 

I{Y;X\PA^)^0. (4) 

Pll] The mutual information actually measures the strength of statistical dependences. 
Since all these dependences are generated by the influence of X on F (and not by a 
common cause or Y influencing X), it makes sense to measure causal strength by strength 
of dependences. Note that mutual information I{X; Y) — H{Y) — H(Y\X) also quantifies 
the variability in Y that is due to the variability in X, see also |A.3| 

Mutual information versus channel capacity. Given the premise that causal strength 
should be an information like quantity, a natural alternative to mutual information is the 
capacity of the information channel x i— > P{Y\x), i.e. the maximum over all values of 
mutual information lQrx){^]Y) for all input distributions Q{X) of X when keeping the 
conditional P{Y\X). 

While channel I{X; Y) quantifies the observable dependences, channel capacity quanti- 
fies the strength of the strongest dependences that can be generated using the information 
channel P{Y\X). In this sense, that I{X; Y) quantifies the factual causal influence, while 
channel capacity measures the potential influence. Channel capacity also accounts for the 
impact of setting x to values that rarely or never occur in the observations. However, 
this sensitivity regarding effects of rare inputs can certainly be a problem for estimating 
the effect from sparse data. We therefore prefer mutual information I{X; Y) as it better 
assesses to what extent the frequently observed changes in X influence Y . 
P|2] Locality implies that we can ignore causes of X when computing €x^Y7 unless 
they are at the same time direct causes of Y. Likewise, other effects of Y are irrelevant. 
Moreover, it does not matter how the dependences between the parents are generated 
(which parent influences which one or whether they are effects of a common cause), we 
only need to know their joint distribution with X. 

Violations of locality would have paradoxical implications. For example, variable Z 
should clearly be irrelevant in DAG a) in Figure [T] Otherwise, £x-j-y would depend on 
the mechanism that generates the distribution of X, while we are actually concerned with 
the information flowing from X to Y instead of the one flowing to X from other nodes. 
Likewise, (see DAGs in Figure fTlb) and c)) it is irrelevant whether X and Y have further 
effects. 



PIs] The postulate quantitatively extends Q: The arrow X -^ Y is the only reason for 
the conditional dependence I{Y;X {PAy) being non-zero, hence it is natural postulating 
that its strength cannot be smaller than the dependence that it generates. 
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Figure 1: DAGs for which the (conditional) mutual information is a reasonable measure of 
causal strength: For a) to c), our postulates imply €x^y = IiX;Y). For c) we will obtain 

(tx^Y = I{X;Y\Z). 

Note that ACE and ANOVA are already ruled out by F[0| Consider a relation between 
three binary variables X, Y, Z, where Y = X (B Z with X and Z being unbiased and 
independent. Then changing X has no influence on the statistics of Y . Likewise, knowing 
X does not reduce the variance of Y. To satisfy F|OJ we would need modifications that 
we do observe an infiuence of X on y for each fixed value z rather than marginalizing 
over Z. We will not consider ACE and ANOVA any further and now discuss information 
theoretic approaches. 

3 Problems of known definitions 

Our definition of causal strength is presented in Section]!] This sections discusses problems 
with three measures of causal strength: mutual information, transfer entropy [S] and 
information flow [TO] . 

3.1 Mutual information and conditional mutual information 

It is sufficient to consider a few simple DAGs to illustrate why mutual information and 
conditional mutual information are not suitable measures of causal strength, despite the 
fact that they arise in special cases (see also FJiJ. Quantifying causal strength in Figure]2kb 
is already as difiicult as the general case because the main challenge in defining €x^y 
is that dependences between X and other parents of Y generate dependences between X 
and Y that interfere with the infiuence of X and Y. 

Mutual information is not suitable in FigureWn. It is clear that I{X:Y) is inappropriate 
because part of the dependency is due to the common cause Z and /(A; Y) ^ 0, which can 
be arbitrarily large even if the arrow X — )■ F is missing. Moreover, DAG a) in Figure 2 
contains d) in Figure p] as limiting case (when the arrow Z ^ X gets weaker) , where F 3 
requires causal strength to majorize the conditional information /(AT; Y \Z). Then causal 
strength cannot be given by /(A; Y) because it is easy to construct transition probabities 
for DAG d) where /(A; F) = and I{X: Y \Z) ^ 0. 

Conditional mutual information is not suitable for Figure]^. Showing that I{X; Y \Z) is, 
nevertheless, inappropriate for DAG a) in Figure]2]is more subtle - after all it qualitatively 
behaves correctly in the sense that it vanishes when the arrow X ^ Y disappears. To see 
the wrong quantitative behavior, we consider the limiting case where the infiuence from 
Z — )■ y gets weaker until it disappears completely at which point we obtain DAG a) in 
Figure]!] We have already discussed that our postulates imply Cx-j-y = liX; Y) and not 
IiX;Y\Z) for this case. 

Note that for DAG a) in Figure]!] /(A; Y) is larger than /(A; Y \Z) because Y ±Z\X 
implies I{X; Y) — I{Z; Y) + I{Y; X \Z) using standard properties of mutual information 
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Figure 2: DAGs for which finding a proper definition of ^x^Y is chahenging. 
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Figure 3: Left: Typical causal DAG for two time series with mutual causal influence. The 
structure is acyclic because instantaneous influences are excluded. Right: counter example 
in [TU] . Transfer entropy vanishes if all arrows are copy operations although the time series 
strongly influence each other. 



[7]. This shows that I{X; Y \Z) underestimates the causal strength at least for the limiting 
case where Z ^Y disappears completely. We will later see that I{X:Y\Z) is also too 
small for the case where Z — > y is strong enough to be relevant. 



Both conditional and unconditional mutual information are unsuitable 
reasons are similar to those for DAG a). I{X\Y) would measure the 
on y, which is partly due to the arrow X ^ Y and partly due to the 
Conditional information I{X] Y \Z) behaves quantitatively correct 
vanishes if the arrow X —^ Y disappears. However, it underestimates 
arrow. Consider the limiting case where the arrow Z ^ Y disappear 
in Figure flj In the limit we wish to obtain I{X] y), which is again lar^ 
due to Y 1- Z \X (see above). 



for FigureWp. The 
overall effect of X 
path over Z. 
in the sense that it 
the strength of the 
s, yielding DAG c) 
er than I{X;Y\Z) 



3.2 Transfer entropy 

Transfer entropy [9^ is intended to measure the influence of one time-series on another 
one. Let {Xt,Yt)tez be a bivariate stochastic process where Xt influence some Ys with 
s > t, see figure [3) left. Then transfer entropy is defined as the following conditional 
mutual information: 



I{Xi- 



-oo,t-l] 



^ Yt \Y( 



(-oo,t-l]; 



/(X(_ 



-oo,t-i]; y* \Y(-oo,t 



-1]) 



It measures the amount of information the past of X provides about the present of Y 
given the past of Y. To quantify causal influence by conditional information relevance 
is also in the spirit of Granger causality, where information is usually understood in the 
sense of the amount of reduction of the linear prediction error. 

Transfer entropy is an unsatisfactory measure of causal strength. [lOj have pointed out 
that this information relevance fails to quantify causal influence for the following toy 



6 




© © © © 



© © 



Figure 4: Time series with only two causal arrows, where transfer entropy fails satisfying 
our postulates. 

model: Assume the information from Xt is perfectly copied to Vt+i and the information 
from Yt to Xt+i. Then the past of Y is already sufficient to perfectly predict the present 
value of Y and the past of X does not provide any further information. Therefore, transfer 
entropy vanishes although both variables heavily influence each other. 

Transfer entropy violates our postulates. We now show that transfer entropy yields bits 
of causal influence in a situation where common sense and our postulates require that 
causal strength is 1 bit. Since our postulates refer to the strength of a single arrow while 
transfer entropy is supposed to measure the strength of all arrows from X to K , we reduce 
the DAG such that there is only one arrow from X to F, see figure [4] Then, 

/(^(-co,t-i] ^ Yt |r(_o,.t_i]) = I[Xi^-oo,t-i]\Yt |y(-oo.t-i]) = I{Xt-i;Yt \Yt-2) ■ 

The causal structure coincides with the DAG a) in figure[T]by setting Yt-2 = Z, Xt-i = X, 
and Yt = Y. With these replacements, transfer entropy yields I{X; Y \Z) —Q bits instead 
of I{X- Y) = 1 bit, as required by F[i] 

Further critical discussion of transfer entropy can be found in 11 in the context of 
cellular automata dynamics. 

3.3 Information flow 

After arguing that transfer entropy does not properly capture the strength of the impact of 
interventions, [TU] proposes to define causal strength using Pearl's do calculus [!'. Given a 
causal directed acyclic graph G, Pearl computes the joint distribution obtained if variable 
Xj is forcibly set to the value Xj as 

P{xi,. ..,Xn\dox'j) := JJP(xi|pai) • 4^_^^ . (5) 

Given three sets of nodes Xa, Xb and Xc in a directed acyclic graph G, information flow 
is computed as 

l{XA^XB\doXc) 

E, \ / I \ P(xB\doxA,doxc) 

To better understand this expression, we first consider the case where the set Xq is empty. 
Then we obtain 

T^v V \ V^ n/ IJ M P{xB\doXA) 



which measures the mutual information between Xa and Xb obtained when the infor- 
mation channel xa '-^ PiXB\doxA) is used with the input distribution P(Xa)- By us- 
ing the post-interventional conditional distribution rather than the observed conditional 
Xa t— >■ P{Xb\xa) as information channel, this definition certainly goes the right step from 
a dependence measure to a measure of causal strength. One may ask why the input 
P{Xa) is used for this channel instead of some other distribution of Xa- Similar ques- 
tions will arise for our definition of causal strength, too. Here we accept P{Xa) at least 
as a straightforward choice, although others could be justified. The question about the 
appropriate input gets even more important when Xc is a non-empty set of "background" 
variables (other causes of Xb, for instance). It is natural to consider then the information 
channels xa '-^ P{XB\doxA,doxc) for different xc and average the resulting mutual in- 
formation over all xc, weighted by P{Xc)- [iQl decided to choose P{XA\doxc) as input 
distribution. Although this choice seems to be more in the spirit of describing causality 
than the choice P{Xa\xc) would be, we should mention that it violates our postulate F|2J 
as described below. 

Information flow is an unsatisfactory measure of causal strength. To quantify X — > K 
in DAGs a) and b) in Figure [2] by information flow, we may either choose I{X -> Y) or 
I{X — >■ Y \Z). We show that both choices are inconsistent with our postulates and with 
our intuitive expectation. We start with I{X -> Y) and DAG a). Let X, Y, Z be binary 
with Y := X Q) Z is the XOR of its causes, Z be an unbiased coin toss and X be a faulty 
copy of Z with two-sided symmetric error. One easily checks that I{X — >■ Y) is zero in 
the limit of error probability 1/2 (making X and Z independent). Nevertheless, dropping 
the arrow X ^^ Y would violate the Markov condition, in contradiction to F|0] For error 
rate close to 1/2, we still violate FIs] because I{Y; X \Z) is close to 1, while I{X — > Y) is 
close to zero. A similar argument can be constructed for DAG b) in the same figure. 

We now consider I{X — > Y\Z). Note that it yields different results for DAGs a) 
and b) if the joint distribution is the same, in contradiction to FJ2J This is because 
P{x\doz) = p{x\z) for a), while P{x\doz) = P{x) for b). In other words, I{X -^ Y \Z) 
depends on what type of causal relation generated the dependences between the two causes 
X and Z. Apart from being inconsistent with our postulate, we find it unsatisfactory that 
I{X -^ Y \Z) tends to zero for the example above if the error rate of copying X from Z 
in DAG a) tends to zero (conditioned on setting Z to some value, the information passed 
from AT to y is zero because X attains a fixed value, too). In this limit, Y is always zero. 
We argue that the link X ^ Y still remains important for explaining the behavior of the 
XOR: without the link, the gate could not always output "zero" , for both values of Z . 

4 Defining the strength of causal arrows 

4.1 Definition in terms of conditional probabilities 

This section proposes a way to quantify the causal influence of a set of arrows that yields 
satisfactory answers in all the cases discussed above. Our measure is motivated by a 
scenario where the nodes represent different parties communicating with each other via 
channels. Hence, we think of arrows as physical channels that propagate information 
between distant points in space, e.g., wires that connect electronic devices. Each such 
wire connects the output of a device with the input of another one. For the intuitive 
ideas below, it is also important that the wire connecting Xi and Xj physically contains 
full information about Xi (which may be more than the information that is required to 
explain the output behavior P{Xj\PAj)). 

We then think of the strength of the arrow Xi — > Xj as the impact of corrupting it, 
i.e., the impact of cutting the wire. To get a well-deflned "post-cutting" distribution we 
have to say what to do with the open end corresponding to Xj, because it needs to be 





P(Z) 



Figure 5: Left: deletion of the arrow X — t- y. The conditional P{Y\X,Z) is fed 
with an independent copy of X, distributed with P{X). The resulting distribution reads 
Px^Y{x,y, z) = P{x, z)Y^^, P{y\z,x')P{x'). Right: deletion of both incoming arrows. The 
conditional P{Y\X, Z) is then fed with the product distribution P{X)P[Z) instead of the 
joint P{X, Y) since the latter would require communication between the open ends. This 
results in the distribution Px^y,z^y{x-, y, z) = ^^,, ^, P(x, z)P{y\x' , z')P{x')P{z'). 



fed with some input. It is natural to feed it probabilistically with inputs Xi according to 
P{Xi) because this is the only distribution of Xi that is locally observable (feeding it with 
some conditional distribution P{Xi\..) would assume that the one who cuts the edge would 
have access to other nodes and not only to the physical state of the channel. Likewise, 
we define the deletion of a set of arrows by feeding all open ends with the product of the 
corresponding marginal distributions. Also here we argue that feeding the open ends with 
some non-product distribution would require communication between the different open 
ends. Figure [5] visualizes the deletion of one edge (left) and two edges (right). 

Remark 1 . The communication scenario also motivates our choice of mutual information 
rather than capacity in postulate F|l] The capacity of X — >■ F cannot be computed locally 
at X, since it requires that the intervener not only observes X but also Y . 

We now define the "post-cutting" distribution formally: 

Definition 1 (removing causal arrows). Let G be a causal DAG and P he Markovian 
with respect to G. Let S C G be a set of arrows. Set PA^ as the set of those parents Xi 
of Xj for which {i,j) € S and PA^ those for which {i,j) ^ S. Set 



Ps{xj\paf) := J2 Pix,\paf,pa^)PYi{pa^') , (6) 



where Prjipa^) denotes for a given j the product of marginal distributions of all variables 
in PA^ . Define new joint distribution, the interventional distribution, 

Psix,, . . . ,x„) ■.= llPsix,\pa^) . (7) 

j 

See Figure p] right, for a simple example with cutting only one edge. Eq. ([7]) formalizes 
the fact that each open end of the wires is independently fed with the corresponding 
marginal distribution, see also Figure [5] right. The modified joint distribution Ps can be 
considered as generated by the reduced DAG: 

Lemma 1 (Markovian). The interventional distribution Ps is Markovian with respect to 
the graph Gs obtained from G by removing the edges in S . 



Proof. By construction, Ps factorizes according to Gs in the sense of (II]). D 

Definition 2 (causal influence of a set of arrows). The causal influence of the arrows in 
S is given by the Kullhack-Leihler divergence 

€s{P) := D{P\\Ps) . (8) 

If 5 is a single edge X^ — > X;, we write £fc_j.; instead of <txi,^Xi ■ 
Note that Ps could easily be confused with different distributions that we obtain 
when the open ends are not fed with the marginal distributions but with conditional 
distributions. We want to explain this for DAG a) in Figure bl Define Px^y{X, Y, Z) by 

Px^y{x, y, z) := P(x, z)P{y\z) = P(x, z) V P{y\x')P(x'\z) , 






and recall that replacing P{x'\z) with P{x') in the right most expression yields Px^y- We 
call Px^Y the "partially observed distribution" . It is the distribution that one erroneously 
gets when ignoring the influence oi X on Y: Px~^y is computed according to (llj), but uses 
a DAG where X -^ Y \s missing. The difference between "ignoring" and "cutting" the 
edge is important for the following reason. By a known rephrasing of mutual information 
as relative entropy (Tj we obtain 

D{P\\Px^y)^I{X\Y\Z), (9) 

which, as we have already discussed, is not in general a satisfactory measure of causal 
strength. 

4.2 Definition via structure equations 

Our definition of Ci-i-j uses the conditional density P{xj\paj). Estimating a conditional 
density from empirical data requires huge samples or strong assumptions (particularly for 
continuous variables). Fortunately, however, structure equations (also called functional 
models T]) allow for a more direct estimation of causal strength without referring to the 
conditional distribution. 

Definition 3 (structure equation) . A structure equation is a model that explains the joint 
distribution P{Xi, . . . , Ar„) by a deterministic dependence 

X,^f,iPA„E,), 

where the variables Ej are jointly independent unobserved noise variables. Note that 
functions fj that correspond to parentless variables can be chosen to be the identity, i.e., 
Xj = Ej . 

Suppose that we are given a causal inference method that directly infers the structure 
equations (e.g., J12U13| ) in the sense that it outputs n-tuples (ej, . . . , e^) with j — \, . . . ,m 
as well as the functions fj from the observed n-tuples {x\, . . . , cc^). 

Definition 4 (removing a causal arrow in a structure equation). Deletion of the arrow 
Xk — )■ Xi is modeled by (i) introducing an i.i.d. copy X'^ of Xk and (ii) subsuming the 
new random variable X'^. into the noise term of fi . The result is a new set of structure 
equations: 

^j = fj (P"ji ej) ifjj^l, and 

xi = fi{pai\{xk},ix'k,ei)j , (10) 

where we have omitted the superscript j to simplify notation. 
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Remark 2. For measuring the causal influence of a set of arrows, the procedure works 
similarly, then we need to introduce jointly independent i.i.d. copies of all variables being 
at tails of deleted arrows. 

Remark 3. The change introduced by the deletion only effects Xi and its descendants, 
the virtual sample thus keeps all Xj with j < k. Moreover, we can ignore all variables Xj 
with i > I due to Lemma |4] 

Note that x'^. must be chosen to be independent of all Xj with j < k, but, by virtue 
of the structure equations, not independent of xi and its descendants. The new structure 
equations thus generate n-tuples of "virtual" observations xf,. . . ,x^ from the input 

(ei,...,(a;fc,e/),...,e„). 

We will show below that the n-tuples generated this way indeed follow the distribution 
Ps{Xi, . . . ,Xn). We can therefore estimate causal influence via any method that esti- 
mates relative entropy distance using the observed samples xi,...,Xn and the virtual 
ones xi, . . . ,Xn- To illustrate the above scheme, we consider the case where Z and X are 
causes of Y and we want to delete the edge X — ^ F. The case where Y has more than 2 
parents follows easily. 

Example 1 (Two parents). The following table corresponds the observed variables X, Z, Y , 
as well as the unobserved noise E^ which we assumed to be estimated together with learn- 
ing the structural equations. 



(11) 



To simulate the deletion of X ^ Y we first generate a list of virtual observations for 
Y after generating samples from an i.i.d. copy X of X' : 



1 z 


X 


E"- 


Y \ 


Zl 
Z2 


Xi 
X2 


el 


/y(zi,xi,en 
/y(^2,a;2,e^) 


\Zyn 


^m 


e^ 


fY{z,n,Xm,eX,)J 



1 Z 


X 


X' 


E^ 


Y \ 


Zl 


Xi 


x[ 


el 


fYizi,x[,el) 


\Zyn 


^m 


X'ra 


e^ 


/y(z„,a;;„,e^)/ 



(12) 



A simple method to simulate the i.i.d. copy is to apply some random permutation ir g Sm 
to xi, . . . ,Xn and obtain a;^(i), . . . , a:^(„), see below. 

We then throw away the two noise columns, i.e., the original noise E^ and the addi- 
tional noise X' : 

(Z X Y \ 



z\ 



~C-Y /y(zi,x'i,ef ) 



(13) 



\Zm XjYi JY [Zmj X^^, e^^j ^ 



To see that this triple is indeed sampled from the desired distribution Ps{X,Y, Z), 
we recall that the original structure equation simulates the conditional P{Y\X, Z). After 
inserting X' we obtain the new conditional '^^, P{Y\x' , Z)P{x'). Multiplying it with 
P{X, Z) yields Ps{X, Y, Z), by definition. Using the above samples from Ps{X, Y, Z) and 
samples from P{X, Y, Z) we can estimate 

<Lx^Y ^ D{P{X,Y,Z)\\Ps{X,Y,Z)) 
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using some known schemes for estimating relative entropies from empirical data. It is 
important (?) that the samples from the two distributions are disjoint, meaning that we 
need to split the original sample into two halfs, one for P and one for Ps ■ 

We now show that random permutations simulate an i.i.d. copy X' . We first observe: 

Lemma 2. Let X he a discrete random variable with probability mass function P{x). 
Given an i.i.d. sample Xi, . . . ,Xm- Let n e S„i be a random permutation. Then the 
empirical distribution of {xj, Xt^ij)) converges for ?ti — > oo weakly to P{x)P{x'), where X' 
is an i.i.d. copy of X . 

Proof, for any functions /, g the empirical expectations factorize asymptotically, i.e, the 
probability that 

_. m ^ rn _. 771 

TO ^ — ' m ^ — ' TO. -"^ — ' 

for a random permutation n converges to zero. Hence, the empirical distribution of (x, x')- 
pairs converges to a product measure, which needs to be P{x)P{x') because we clearly 
have weak convergence to P{x) for the marginals. D 

For our application, we need a slightly stronger version that ensures that the permuted 
sample is also independent of the other parents of Xi : 

Lemma 3. Let X,W he two random variables with joint density P{x,'w). Given an 
iid sample {xj,Wj) with j — 1, . . . ,to,. Let tt e Sm be a random permutation. Then the 
empirical distribution of the sample {xj,Wj,XTr{j)) converges weakly to P{x,w)P{x') where 
X' is an i.i.d. copy of X. 

Proof. Using vector valued random variables in Lemma [21 we obtain P{X,E)P{X' ,E') 
by jointly permuting x, e. Then the statement follows by marginalizing over E' . D 

Lemma w] shows that we can indeed generate X' the way proposed above. 

4.3 Properties of causal strength 

This subsection shows that our definition of causal strength satisfies all our postulates. 
In proving this, we observe at the same time some other useful properties. We don't need 
to prove F|0] because it is implied by F|3j 

P(lJ One easily checks €x^y = I{X;Y) for the 2-node DAG X ^ Y (Postulate [l]), 
because Px-^Yix,y) = P{x)P{y) , and thus 

DiP\\Px^Y) = DiPiX, Y)\\PiX)PiY)) = I{X; Y) . 

P|2j Note that the definition of <^k^i refers to the impact of the cut on the entire joint 
distribution P{Xi, . . . ,X„). It is therefore not obvious that £ in fact only depends on the 
joint distribution of Xi and its parents, as required by Postulate [2J The following result 
shows that this is the case; furthermore it writes causal strength as an expression that is 
convenient for practical applications discussed later. 

Lemma 4 (causal strength as local relative entropy) . Causal strength ftk^i can be written 
as the following relative entropy distance or conditional relative entropy distance: 

€k^i = D[P{XuPAi)\\Ps{XuPAi)] 

= Y.D[PiXi\P^i) II Ps{Xi\pai)] P{pai) = D [P{X,\PA,) \\ Ps{X,\PA,))] . 
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Note that Ps{Xi\pai) actually depends on the reduced set of parents PAi \ Xk only, 
but it is more convenient for the notation and the proof to keep the formal dependence 
on all PAi. 



Proof. Due to 



and 



we have 



P{Xu....X^)=\{p{X,\PA,), 

3 

Ps{X^. . . . .Xr,) ^\{Ps{X,\PA,) , 

3 

D{P\\Ps) = 5]i? [P(X,|P^,) II Ps{X,\PA,)] . 



For aU j ^ I we have D [P{Xj\PAj)\\Ps{Xj\PAj)] = 0, because P{Xi\PAi) is the only 
conditional that is modified by the deletion. D 

Pis} Apart from demonstrating the postulated inequality, the following result shows that 
we have the equality €x^y = I{X;Y iPAy) for independent causes. Moreover, it pro- 
vides an information theoretic interpretation for the additional term that occurs for de- 
pendent causes. To keep notation simple, we have restricted our attention to the case 
where Y has only two causes X and Z, but Z can also be interpreted as representing all 
parents of Y other than X. 

Theorem 5 (decomposition of causal strength). For the DAGs in FigurelEwe have 

Cx^Y -- I{X;Y \Z) + D[P{Y\Z)\\Px^y{Y\Z)] . (14) 

// X and Z are independent, the second term vanishes. 
Proof. Due to Px^y{x, y, z) = J2x' ^ivW ^ z)P{x')P{x, z) we have 

P{y\x,z) 



D{P\\Px^y) = ^P(y|a;,z)P(a:|z)P(z)log 



Y.^,P{y\x',z)P{x') 
P{y\x,z) 



= ^P(y|a;,^)P(a:|z)P(z)log 

= I{X- Y\Z) + D [P{Y\Z) II Px^y{Y\Z)] . 

To see that the second term vanishes for independent X,Z, we observe Px^y{Y\Z) = 
P{Y\Z) because 

Px^Yiy\z) = Y. ^(2^1^' ^)^(^) ^ E ^(2^1^' ^)^(^l^) = Piy\^) ■ 

X X 

D 

Theorem [5] states that conditional mutual information underestimates causal strength. 
Assume, for instance, that X and Z are always equal because Z has such a strong influence 
on X that it is an exact copy of it. Then I{X; Y \Z) = because knowing Z leaves no 
uncertainty about X. To see that causal influence cannot always be zero just because X 
and Z coincide, we consider the limiting case where the influence oiZonY gets weaker 
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and weaker, while keeping the strong influence on X. Then we obtain Z ^ X ^ Y as 
limiting DAG. We have already seen that Cx-s-y = I(X; Y) in this case. In other words, 
strong dependences between the causes X and Z makes the influence of cause X almost 
invisible when looking at the conditional mutual information I{X; Y \Z) only. The second 



term in ( 14 1 corrects for the underestimation. When X depends deterministically on Z, 
it is even the only remaining term. 

To provide a further interpretation of Theorem [sl we recall that I{X;Y\Z) can be 
seen as the impact of ignoring the edge X ^ Y, see remarks around cq. Q. Then the 
impact of cutting X ^ Y is given by the impact of ignoring this link plus the impact 
the cutting has on the conditional P{Y\Z). In the appendix we show that this result 
generalizes to cutting and ignoring multiple links. 

Finally, we collect together some nice properties of causal influence in the following 
theorem: 

Theorem 6 (relation between strength of sets and single arrows) . 

The causal influence given in Definition [1| satisfies additivity on targets, locality, and 
monotonicity. 

a) Additivity regarding targets. 

Given set of arrows S , let Si — {s (z S\trg(s) = Xi}, then 

i 

h) Locality. 

Every £5. only depends on the conditional P{Xi\PAi) and the joint distribution 0/ all 
parents P{PAj). 

c) Monotonicity. 

Given sets of arrows Si C 5*2 targeting single node Z , such that the source nodes in 6*2 
are jointly independent. Then we have 

Proof Appendix |A.2[ D 

The intuitive meaning of these properties is as follows. Part (a) says that causal 
influence is additive if the arrows have different targets. Otherwise, we can still decompose 
the set S into equivalence classes of arrows having the same target and obtain additivity 
regarding the decomposition. We will show in Subsection [44] that general additivity fails. 
Part (b) is an along of F(2]for multiple arrows. According to (c), the strength of a subset 
of arrows cannot be smaller than the strength of its superset, provided that there are no 
dependences among the parent nodes. 

4.4 Examples and paradoxes 

Although Theorem [6] shows that causal influence behaves nicely in many situations, there 
remain some examples where the results are somewhat counterintuitive. We collect these 
here. 

Failure of subadditivity: The strength of a set of arrows is not bounded from above by 
the sum of strength of the single arrows. It can even happen that removing one arrow from 
a set has no impact on the joint distribution while removing all of them has significant 
impact: 

Example 2 (error correcting code). Let E and D he binary variables that we call "en- 
coder" and "decoder" (see figure\m communicating over a channel that consists of the bits 
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Figure 6: Causal structure of an error-correcting scheme: the encoder generates 2k + 1 bits 
from a single one. The decoder decodes the 2k + 1 bit words into a single bit again. 

Bi,. . . ,i?2A;+i- Using the simple repetition code, all Bj are just copies of E. Then D is 
set to the logical value that is attained by the majority of Bj. This way, k errors can be 
corrected, i.e., removing k or less of the links Bj — > E has no effect on the joint distribu- 
tion, i.e., Ps = P for S := {Bi -^ D , B2 ^ D , . . . , B^ -^ D) , hence 1s{P) = 0. In words: 
removing k or less arrows is without impact, but removing all of them is, of course. After 
all, the arrows jointly generate the dependence I{E : D) — I{E : Bi, . . . , B^, D) = 1. 

Clearly, the outputs of E causally influence the behavior of D. We therefore need to 
consider interventions that destroy many arrows at once if we want to capture the fact 
that their joint influence is non-zero. 

Thus, causal influence of arrows is not subadditive: the strength of each arrow E — > Bj 
is zero, but the strength of the set of all E — > Bj is 1 bit. 

Failure of superadditivity: The following example reveals an opposing phenomenon, 
where the causal strength of a set is smaller then the sum of the single arrows: 

Example 3 (XOR-gate whh COPY). 

The causal influence of each arrow targeting an XOR-gate individually is the same as the 
causal influence of both arrows taken together: 

<tx^z{P) = ^Y^ziP) - €x^z^Y^z{P) - 1 bit. 

Strong influence without dependence: Revisiting our XOR-example is also instruc- 
tive because it demonstrates an extreme case of confounding where I{X;Y\Z) vanishes 
but causal influence is strong. 

Example 4 (XOR-gate with copy). 

Consider the DA G a) in figure \E and let the relation between X, Y, Z be given by the 

structure equations 

X = Z , 

Y = X®Z. 



Removing X ^ Y yields 



Px^y{x, y, z) = P{x)P{y)P{z\x, y) , 
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Figure 7: Broadcasting one bit from one node to multiple nodes. 

where P{x) = P{y) = 1/2 and P{z\x^y) = 5z-x®y It is easy to see that 

DiP\\Px^Y) = l, 

because P is a uniform distribution over 2 possible triples {x,y,z), whereas Px^y is a 
uniform distribution over 4 combinations. 

The impact of cutting the edge X ^ Y is remarkable: both distributions, the observed 
one P as well as the post-cutting distribution Ps, factorize Ps{X, Y, Z) — Ps{X, Z)Ps{Y) 
and P{X,Y, Z) = P[X, Z)P{Y). Cutting the edge keeps the product structure and only 
changes the marginal distribution ofY. 

Strong effect of little information: The following example considers multiple arrows 
and shows that their joint strength may even be strong when they carry the same small 
amount of information: 

Example 5 (broadcasting). 
Consider a single source X with many targets Yi, . . . ,Yn such that each Yi copies X , see 
FigureVA Assume P{yQ = 0) = P{yo = 1) = 2- If S is the set of all arrows X ^ Yj then 
<Ls = ri. Thus, the single node X exerts n bits of causal influence on its dependents. 

5 Causal influence between two time series 

5.1 Definition 

Since causal analysis of time series is of high practical importance, we devote a section 
to this case. For some fixed t, we introduce the short notation X —)' Yt for the set of all 
arrows that point to Yt from some Xs with s < t. Then 

measures the impact of deleting all these arrows. We propose to replace transfer entropy 
with this measure since it does not suffer from the drawbacks described in Subsection 13.21 

Subsection |4.2| describes how to estimate causal strength from finite data for one 
arrow and briefly mentions how this generalizes to set of arrows. To keep this section 
self-consistent, we briefly rephrase the description for the case of time series. 

Suppose we have learned the structure equation model 

Yf — ft[Xt^i,Xt^2, ■ ■ ■ T^t^p, Ef) , 

from observed data {xt,yt)t<o^ where the noise variables Et are jointly independent and 
independent oi Xt, Xt-i, . . . , Yt-i, Yt-2, .... Assume, moreover, that we have inferred the 
corresponding values (e()f<o of the noise. 
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To generate virtual observations via permutations we either need k samples for each 
time instant i, or the time series must be approximately stationary over a sufficiently 
large time interval. In the first case, we choose random permutations tti, . . . , TTp e 5^ that 
permute the k samples within each time instant t. In the second case, we choose some 
number m ^ p and formally treat {xs-jm)'j=i as a sample, on which the permutations 
act. The permutations generate virtual observations 

^t — l 5 ■ ■ • ; ^t—p ; 

from which we compute the virtual Yt values 

ijt := ft{xt~i,-- .,xt-p,et). 
Then we estimate the relative entropy distance between the joint distribution of 

Yt, Xt^i, . . . , Xf-p 
given by the real observations yt,Xt-i, . . . , Xt-p and the virtual ones yt-, Xt-i, • . . , Xt-p. 

5.2 Comparison of causal influence with transfer entropy 

We first recall the example given by |10) showing a problem with transfer entropy (Subsec- 
tion 3.2 1. Assume that the variables Xt, Yt in figurepj right, are binary and the transition 
from Xt^i to Yt is a perfect copy and likewise the transition from Yt-i to Xt. Assume, 
moreover, that the system has been initialized such that, with probability 1/2, all vari- 
ables are 1 and with probability 1/2 all are zero. Then the set X —)■ Yt is the singleton 
S :— {Xt-i — > Yt}. Using Lemma[4J we have 

€x,_,^Y, ^ D[P{Yt,Xt-i)\\Ps{Yt,Xt-i)] . 

Since Yt is a perfect copy of Xt-i, we have 

1/2 for xt-i = yt 



nyuxt-i) = [ ootherlise 



into 

Ps{yt,xt-i) = 1/4 for {yt,xt-i) G {0, 1}^ . 

One easily checks D{P\\Ps) = 1. 

Note that the example is somewhat unfair, since it is impossible to distinguish the 
structure equations from the case where Xt+i is the opposite of Xt and similarly for Y , 
no matter how many observations are performed. Thus, from observing the system it is 
impossible to tell whether or not X is exerting an influence on Y . However, the following 
modification shows that transfer entropy goes quantitatively still wrong if small errors are 
introduced: 

Example 6 (perturbed transfer entropy counterexample). 

Perturb Ay and Polani's example by having Xt copy Yt-i correctly with probability p = 

1 — e. Set node Xt 's transitions as Markov matrix 



( 


Xt =0 


xt = l\ 


yt-i - 
\yt-i = 1 


1-e 
e 


.-J 



and similarly for the transition from Yt-i to Xt. 
The transfer entropy from X to Y at time t — is 

TE -.^ I{Xt^_^,t-,y,Yt\Y^^^^t-i]) ^ I{Xt-i;Yt\Yt-i) , 
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where we have used 

Some calculations show 
TE^{1~ ef ■ log 



Yt J- Xt-2, Xt-^, Yt-i, Yt-2, ■ ■ ■ \Xt-i ■ 



1-e 



1 - 2e + 2e2 



e • log 



1 - 2e + 2e' 



+ e(l - e) • log 



46(1 - e) ' 



which tends to zero for e -^ 0, in agreement with the unperturbed example. Causal influ- 
ence, on the other hand, reads 

e =(1 - e) • log(2 - 2e) + e • log(2e) , 

which tends to 1 fore — >■ 0. Hence, causal influence detects the causal interactions between 
X and Y based on empirical data, whereas transfer entropy is not. Thanks to the pertur- 
bation, the joint distribution tells us the kind of causal relations by which it is generated. 
For large enough samples, the strog discrepancy between transfer entropy and our causal 
strength thus becomes apparent. 



A Appendix: Further properties of causal strength 

A.l Decomposition into conditional relative entropies 

The following result generalizes LemmaHlto the case where S contains more than one edge. 
It shows that causal strength, defined via the relative entropy between two distributions 
on the entire DAG, decomposes into a sum of conditional relative entropies, each referring 
to the conditional distribution of one of the target nodes given its parents: 

Lemma 7 (causal influence decomposes into a sum of expectations). 
The causal influence of a set of arrows S can be rewritten 






J2PiX,\PAl,pa^)-P^{pa^ 



PO" 



where trg{S) denotes the target nodes of arrows in S . 

An intuitive implication of this result is Theorem |6] in the main text, whose proof is 
given in the next section. 

Proof. By definition, £s(-P) = D{P\\Ps) and we obtain 



DlliPiXjlPA, 



Y[J2PiX,\PAS,pa^).p^ipa^ 

i=l paf 



d\ Yl P{X,\PA,) 

KjetrgiS) 



n ^P(X,|P4,paf).PnK) 

jetrg{S) paf 



J2 D I P{X,\PA,) II ^P(X,|PAf ,paf) • Puipa^) 

jetrg{S) 



(15) 
(16) 
(17) 



PCj 



(15) ~ (16): Only the distributions of elements targeted by arrows in S are affected by 



the marginalization and therefore contribute to the causal influence; others play no role 



18 



and cancel out in the logarithms. Nodes in pas cancel out of the logarithm for the same 
reason. However, what remains inside the logarithm is a function of these values, hence 
the expectation over these nodes is nontrivial. 
(16) = (17 1: Products go to sums by definition of relative entropy. D 



A. 2 Proof of Theorem |6] 



Proof. Parts (a) and (b) follow from Lemma m since <Lsi{P) is the ith summand in (171, 
which obviously depends on P{Xi\PAi) and P{PAi) only. To prove part (c), start with 
the special case where G = {X -^ Z,Y ^ Z} and the sets are Si = {Y — > Z} and 

S2 = {X -^ z,Y -^ zy. 



= Y,Pi^)Piy)Dlp{Z\x,y) 
+ ^P(x)P(y)P(z|x,y)log 



x,y,z 



Y^P{Z\x,y)P{x)P{y) 
x,y / 

Y,P{Z\x,y)P{y)\ 
y / 

Y.yP{A^.v)P{y) 

Y..^,P{z\x,y)P{x)P{y) 



= €s, (p) + Y. Pi^)D E ^(^1^' y)p^y) 



Y,P{ZW,y')P{x')P{y') 

x' ,y' 



In this case the proposition follows since D{R\\Q) > for any R and Q. 
In the general case, the independence of the parents of Z implies 

P{z,paz) - P{z\paz) ■ Puip4") ' Pip4"\p4") ■ 
It follows that 

^sAP) = E^n(P«z ) • ^b4b«z ) • D (P{Z\paz) 



PsAZ\pa%n 



J2 PniP'^z) ■ Pip4' b«f ) • D {P{Z\paz) Ps, {Z\pal^ 



Ji 



J2 Pnip4') ■ Pip4' \p4') ■ PiApo^z) ■ log 



Sl^ 



PsMp4') 



PsMp^z) 



u 



The coefficient of the logarithm, ^ P{z\paz)-PY\{pO'z)'P^p4'\p4')^ '-^'^ ^^ written as a 
sum of terms of the form Pg^ (z|pa|^) = ^ s^ P{z\paz)PYi{paz^) since we have assumed 
that the sources are independent. Consequently, 

£5. {P) = £5. {P) + E ^n ip4''''' ) ■ p{p4' b«z ^^^ ) • D [Ps, {z\p4i ) Ps, iz\p4i ) 

and the proposition follows because relative entropy is non-negative. 

A. 3 Causal influence measures controflability 

Causal influence is intimately related to control. Suppose an experimenter wishes to 
understand interactions between components of a complex system. She is able to observe 
nodes Y and Z, and manipulate node X. To what extent can she control node Y? The 
notion of control has been formalized information-theoretically in [14j : 

Definition 5 (perfect control). 

Node Y is perfectly controllable by node X at Z ~ z if, given z, 
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i) states ofY are a deterministic function of states of X ; and 

a) manipulating X gives rise to all states ofY. 

Perfect control can be elegantly characterized: 

Theorem 8 (information-theoretic characterization of perfect controllability). 

A node Y with inputs X and Z is perfectly controllable by X alone for Z = z iff there 

exists a Markov transition matrix R[x\z) such that 

H{Y\z,doX) ■.= ^R{x\z)H{Y\z,dox) =0, and (CI) 

X 

^P{y\z,dox)R{x\z)^0 for ally. (C2) 

xeX 

Here, H{Y\z,dox) denotes the conditional Shannon entropy ofY, given that Z = z has 
been observed and X has been set to x. 

Proof. The theorem restates the criteria in the definition. For a proof, see [M]. D 

It is instructive to compare TheoremlSlto our measure of causal influence. The theorem 



highlights two fundamental properties of perfect control. First, (CI I, perfect control 
requires there is no variation in Z^s behavior given the choice of y. Second, (C2|, perfect 
control requires that all potential outputs of Z can be induced by manipulating node Y . 
This suggests a measure of the degree of control should reflect (i) the variability in Z's 
behavior that cannot be eliminated by imposing Y values and (ii) the size of the repertoire 
of behaviors that can be induced on the target by manipulating a source. 
If X and Y are independent then by Theorem [5] 

Cx^y (P) = I{X- Y \Z) = H{Y\Z) - H{Y\X, Z) . 

The first term, H{Y\Z), quantifies size of the repertoire of outputs of Y averaged over in- 



puts from Z. It corresponds to requirement ( C2 ) in the characterization of perfect control: 



that Y^^P{y\z,dox)R{x\z) > for all z. Specifically, the causal influence, interpreted 
as a measure of the degree of controllability, increases with the size of the (weighted) 
repertoire of outputs that can be induced by manipulations. 

The second term, H{Y\X, Z) (which coincides with H{Y\Z,doX) here), quantifies 
the variability in Y's behavior that cannot be eliminated by controlling X. It corre- 



sponds to requirement (CI) in the characterization of perfect control: that remaining 
variability should be zero. Causal infiuence increases as the variability H{Y\Z,doX) — 
J2z P{^)H{Y\z, doX) tends towards zero. 

A. 4 Causal strength majorizes observed dependence 

Recalling that P{Xi, . . . ,Xn) factorizes into Y\-P{Xj\PAj) with respect to the true 
causal DAG G, one may ask how much error would arise if one was not aware of all 
causal infiuences and erroneously worked with a DAG where interactions across some set 
of arrows S in the true DAG G are hidden. The conditionals with respect to the reduced 
set of parents define a different joint distribution: 

Definition 6 (partially observed distribution). 
Given distribution P , Markovian with respect to G, and set of arrows S, let the partially 
observed distribution (where interactions across S are hidden) for node Xj be 



Ps{xj\paf) = ^P{xj\pa^j ,paf)P{pa^j\pa 



20 



Let the partially observed distribution for all the nodes be the product 

Ps{xi,...,Xn) ^Yl^s{xj\paj). 

2 

Remark 4. Intuitively, the observed influence of a set of arrows should be quantified by 
comparing the data available to an observer who can see the entire DAG with the data 
available to an observer who sees all the nodes of the graph, but only some of the arrows. 
Definition [6] formalizes "seeing only some of the arrows" . 

The definition of the observed dependence of a set of arrows takes the same general 
form as for causal influence. However, instead of inserting noise on the arrows, we instead 
simply prevent ourselves from seeing them: 

Definition 7 (observed influence). 

Given distribution P Markovian with respect to G and set of arrows S, let the observed 

influence of the arrows in S be 

Ds{P):=D{P\\Ps). 

The following result generalizes Theorem [5] 

Theorem 9 (causal influence majorizes observed dependence). 

Gausal influence decomposes into observed influence plus a non-negative term quantifying 

the divergence between the partially observed and interventional distributions 

€siP)^DsiP) + D{Ps\\Ps). 

The theorem shows that "snapping upstream dependencies" by using purely local data 
- i.e. by marginalizing using the distribution of the source node P{Xi) rather than the 
conditional P{Xi\PAi) - is essential to quantifying causal influence. 

Proof Expand €siP) as 



D{P\\Ps)=J2Pi^i---^n)lo^ 



E 



P{xi ...Xn) 
Ps{xi ...Xn) 

Tjf M P{xi...Xn) , Y^ n, X, Ps{xi...X.n) 

P{xi ...Xn) log ^— + > P{xi ...Xn) log ^-7 r. 

Ps{xi...Xn) ^^ Ps{Xx...Xn) 



By the proof of Lemma [7J the second term can be written as 

^ d(Ps{X,\PA^^)\\Ps{X,\PA^^\ 

= Yl D{Ps{X,\PAI\\Ps{X,\PAI))^D{Ps\\Ps), 

j<£trg{S} 

where the last equality also follows by the proof of Lemma [7] 

Causal influence is thus observed influence plus a correction term that quantifies the 
divergence between the partially observed and interventional distributions. The correction 
term is non-negative since it is a KuUback-Leibler divergence. D 

B Another option to define causal strength 

We now discuss a slightly different approach to defining the strength of an arrow. Although 
it has many nice properties and is quite intuitive, it fails satisfying Postulate [3J To define 
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the strength of the arrow X — > F wc consider X and its parents PAy and define a 
modified joint distribution on PAy by 



P'{X,PA^) -.^ P{X)P{PA^)P{Y\PA 



Y , 



In words: we remove the dependences between X and the other parents of PAy ■ Then 
we define causal strength by the conditional mutual information 



Ip,{X;Y\PA^) 

with respect to the modified distribution. The modification can be thought of describ- 
ing the post-interventional distribution where X is set to the values x according to the 
observed marginal distribution P{X). To show that Postulate Is] is violated, we consider 
the case where two dependent variables X and Z influence Y. Let X consist of fc + 1 
bits, y be fc bits and Z be just one bit. Call the first fc bits of X the message and the 
remaining one the control bit. Define P{Y\X, Z) such that the message bits are copied to 
Y whenever both the control bit of X and the variable Z are set to 1. Otherwise, they 
are uniformly distributed on {0, 1}*^. To specify P{X, Z), we first define a distribution on 
P(Xi,Z)by 

P(x, z)^l ^/^ ^°' "^1 ^ ^ 
\ otherwise 

Then we set 

P{X,Z):^PiX,,Z)P{X2,...,Xk), 

where P{X2, ■ ■ ■ , Xk) is the uniform distribution on {0, 1}'^. It easy to see that 

I{X;Y\Z) =fc/2 

because the fc message bits are copied whenever Z = Xi = 1, which happens with 
probability 1/2. However, modifying P to P' breaks the coupling between Xi and Z, and 
Xi = Z — \ only happens with probability 1/4. Thus, the message bits are only copied 
with probability 1/4 and therefore 

Ip: (X-, Y\Z) = k/A< I{X: Y\Z), 

while Postulate |3] requires 

Ip.iX;Y\Z)>I{X;Y\Z). 

C The problem of defining total infiuence 

If X influences Y via directed paths other than a direct arrow, we may want to measure 
the total influence of X on Y. 

However, we cannot quantify total influence by quantifying the impact of removing all 
the arrows on the directed paths connecting X and Y . To see this, consider the causal 
chain X — >■ Z — >■ y and assume that X and Z are strongly coupled (e.g. by a copy 
operation), but Z and Y are weakly coupled (e.g. a very noisy copy operation). Then, 
removing both arrows X ^ Z and Z — > y has a large impact on P{X, Z, Y) although 
y obtains almost no signal from X. In this simple case, total infiuence may be defined 
by replacing the path — >■ Z — > with a single arrow and computing the direct infiuence 
in X —> Y after marginalizing P{X, Z,Y) to P{X,Y). However, we do not see how to 
construct a general rule for shrinking total influence to a direct arrow. To describe the 
problem, consider DAG e) in figure ^ Since X influences Y only directly, we may want 
to consider Cx-s-y = I{X; Y \Z) also as the strength of the total infiuence. On the other 
hand, for case f), we would tend to consider I{X; Y) as the total infiuence. This is because 
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shrinking the DAG to a direct influence would represent the net effect of both influences, 
the direct one X -^ Y, and the indirect one X ^ Z ^ Y into a single arrow X ^ Y. The 
distribution P{X,Y) is simply given by marginalizing P{X, Z,Y). On the other hand, 
DAG e) can be considered as a special instance of case f ) by adding an irrelevant arrow 
to e). Inserting the "virtual edge" then changes the total influence from I{X;Y\Z) to 
I{X; Y). If, for instance Y = X (B Z and X and Z are unbiased coins, I(X; Y) = but 
IiX;Y\Z) = l. 
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